Phillip (Armand) Bester is a medical scientist, researcher, and lecturer at the Division of Virology, University of the Free State, and National Health Laboratory Service (NHLS), Bloemfontein, South Africa

Andrie de Vries is the author of “R for Dummies” and a Solutions Engineer at RStudio

Introduction

In part 2 of this series, we discussed the PhyloPi pipeline for conducting routine HIV phylogenetics in the drug resistance testing laboratory as a part of quality control. As mentioned, during HIV replication the error-prone viral reverse transcriptase (RT) converts its RNA genome into DNA before it can be integrated into the host cell genome. During this conversion, the enzyme makes random mistakes in the copying process. These mistakes or mutations can be deleterious, beneficial or may have no measurable impact on the replicative fitness of the virus. This fast rate of mutation provides enough divergence to be useful for phylogenetic analysis. As we will see in the 4th part of this series, the intrapatient divergence is smaller than the interpatient divergence. As infections spread from person to person the virus will continue to mutate and become more and more divergent. This allows us to use the genetic information we obtain while doing the drug resistance test and analyse the sequences for abnormalities.

Cuevas et al. (2015) published work on the in vivo rate of HIV evolution. Their analysis revealed the highest mutation rate of any biological entity of \(4.1 \cdot 10^{-3} (sd=1.7 \cdot 10^{-3})\). However, error-prone RT is not the only mechanism of mutation. One defence against HIV infection is an enzyme called apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like or APOBEC. As the name suggests these are enzymes which act on RNA and converts or mutates cytidine to uridine. In other words C to U which is the impact on RNA or C to T with DNA. The letter U in RNA is synonymous to T in DNA. Also, shown by Cuevas et al, these enzymes are not equally active in all people.

So how does this affect the virus in a negative way?

We first need to understand how RNA is translated into proteins. Below is a table showing the codon combinations for each of the 20 amino acids.

As can be seen from the table above, some amino acids are encoded by more than one codon. For example, if we change the codon, CGU to AGA the resulting amino acid stays Arginine or R. This is referred to as a silent mutation since the resulting protein will look the same. On the other hand if we mutate AGU to CGU the resulting mutation is from Serine to Arginine or in sigle letter notation, S to R. A change in the amino acid is refered to as a non-synonimous mutation. In reality the APOBEC enzyme recognizes specific RNA sequence motifs, but just to give and idea of how this works, lets look at an example.

Load some packages

library(ape)
library(tibble)
library(tidyr)
library(dplyr)
library(knitr)
library(plotly)
library(RColorBrewer)
WT <- c("CGA", "GUU", "AUA", "GAG", "UGG", "AGU")

We have the sequence CGAGUUAUAGAGUGGAGU which we created in the cell block above as codons for clarity. We can now translate this seqeunce using the codon table or some function.

AA <- trans(as.DNAbin.DNAString(gsub("U", x = paste0(WT, collapse = ""), replacement = "T" )))

AA <- as.character.AAbin(AA)[[1]]

The code block above translated our RNA sequence into a protein sequence: R, V, I, E, W, S

Now lets mutate all occurances of C to U/T

MUT <- gsub("C", "U", WT)

The resulting mutant sequenc is: UGA, GUU, AUA, GAG, UGG, AGU and if we now translate that we get

AA <- trans(as.DNAbin.DNAString(gsub("U", x = paste0(MUT, collapse = ""), replacement = "T" )))

AA <- as.character.AAbin(AA)[[1]]

the protein sequence: *, V, I, E, W, S.

The * means a stop codon was introduced. Stop codons are responcible for terminating translation from RNA to protein. If one of the viral genes has a stop codon in it the protein will truncate will be disfuctional.

Calculating genetic distances from a multiple sequence alingment (msa)

In the previous seqtion we showed the general principle of a msa. In biology sequence alingments are used to look at similarties of DNA or protein sequences. For most phylogenetic analysis a mutliple sequence alingment is a requirement and the more accurate the msa, the more accurate the phylogenetic inference. There are different ways of displaying a msa. Lets explore some of these options.

Read in the multiple sequence alignment file

# Read in the alignment file
aln <- read.dna('example.aln', format = 'fasta')
# Calculate the genetic distances between sequences using the K80 model, as.mattrix makes the rest easier
alnDist <- dist.dna(aln, model = "K80", as.matrix = TRUE)

Reduction of heatmap to focus on the important data

The pipeline mentioned uses the Basic Local Alignment Search Tool (BLAST) to retrieve previously sampled sequences and adds these retrieved sequences to the analysis. BLAST is like a search engine you use on the web, but for protein or DNA sequences. By doing this important sequences from retrospective samples are included which enables PhyloPi to be aware of past sequences and not just batch per batch aware. Have a look at the paper for some examples.

The data we have is ready to use for heatmap plotting purposes, but since the data also contains previously sampled sequences, comparing those sequences amongst themselves would be a distraction. We are interested in those samples, but only compared to the current batch of samples analysed. The figures below should explain this a bit better.


A diagram of a heatmap with lots of redundant and distracting data.

A diagram of a heatmap with lots of redundant and distracting data.


From the image above you can see that, typical of a heatmap, it is symmetrical on the diagonal. We show submitted vs retrieved samples in both the horizontal and vertical direction. Notice also, as annotated as “Distraction”, are the previous samples compared amongst themselves. We are not interested in those sample now as we would already have acted on any issues then. What we want instead, is a heatmap as depicted in the image below.


A diagram of a more focussed heatmap with the redundant and distracting data removed.

A diagram of a more focussed heatmap with the redundant and distracting data removed.


Luckily for us, we have a very powerful tool to our disposal, R, and plenty of really useful and convenient packages, like dplyr.

alnDistLong <- 
  alnDist %>% 
  as.data.frame(stringsToFactors = FALSE) %>% 
  rownames_to_column(var = "sample_1") %>% 
  gather(key = "sample_2", value = "distance", -sample_1, na.rm = TRUE) %>% 
  arrange(distance)

Create a new variable, combined, we will paste the names for sample_1 and sample_2 together

Final cleanup and removal of distracting data

# get the names of samples originally in the fasta file used for submission
qSample <- names(read.dna("example.fasta", format = "fasta"))

# compute new order of samples, so the new alignment is in the order of the heatmap example
sample_1 <- unique(alnDistLong$sample_1)
new_order <- c(sort(qSample), setdiff(sample_1, qSample))

Plot the heatmap using plotly for interactivity

alnDistLong %>% 
  filter(
    sample_1 %in% qSample,
    sample_1 != sample_2
    ) %>% 
  mutate(
    sample_2 = factor(sample_2, levels = new_order)
  ) %>% 
  plot_ly(
    x = ~sample_2,
    y = ~sample_1,
    z = ~distance,
    type = "heatmap", colors = brewer.pal(11, "RdYlBu"), 
    zmin = 0.0, zmax = 0.03,  xgap = 2, ygap = 1
) %>% 
  layout(
    margin = list(l = 100, r = 10, b = 100, t = 10, pad = 4), 
    yaxis = list(tickfont = list(size = 10), showspikes = TRUE),
    xaxis = list(tickfont = list(size = 10), showspikes = TRUE)
  )

Phylogenetic tree

Above we used the package ape to calculate the genetic distances for the heatmap.

Another way of looking at our alignment data is to use phylogenetic inference. The PhyloPi pipeline saves each step of phylogenetic inference to allow the user to intercept at any step. We can use the newick tree file (a text file formatted as newick) and draw our own tree.

tree <- read.tree("example-tree.txt")
plot.phylo(
  tree, cex = 0.8, 
  use.edge.length = TRUE, 
  tip.color = 'blue', 
  align.tip.label = FALSE, 
  show.node.label = TRUE
)
nodelabels("This one", 9, frame = "r", bg = "red", adj = c(-8.2,-46))

We have highlighted a node with a red block, with the text “This one” which we can now discuss. We have three leaves in this node, KM050043, KM050042, KM050041 and if you would look up these accession numbers at NCBI you will notice the Title of the paper it is tied to:

“HIV transmission. Selection bias at the heterosexual HIV-1 transmission bottleneck”

In this paper, the authors looked and selection bias when the infection is transmitted. They found that in a pool of viral quasispecies transmission is biased to benefit the fittest viral quasispecie. The node highlighted above shows the kind of clustering one would expect with a study like the one mentioned above. You will also notice plenty of other nodes which you can explore using the accession number and searching for it at https://www.hiv.lanl.gov/components/sequence/HIV/search/search.html